16S_Caulosphere_Microbiome2017_FA19

R Packages

The following code uses several packages including vegan for community ecology, tidyr and ply for organizing data tables and manipulating strings of text, and ggplot for plotting data.

Project Summary

The following code is used to analyze 16S data from drill shaving samples collected from phloem tissues of 47 Juglans nigra trees. Raw reads were generated on the Illumina Miseq platform (2x250) and processed into OTUs using the bioinformatics pipeline Mothur. Following sequence processing in Mothur, a total of 4,530 OTUs were generated from 18,922 unique sequences (333,951 sequences total from 5,628,726 reads prior to processing).Taxonomic assignments were made in Mothur using the SILVA database. OTUs were generated in MOTHUR using a 97% similarity threshold for sequences aligned to the V4 region of the 16S rRNA region.

Data Import

Below raw OTU tables (data has not yet been rarefied), metadata which indicates the state from which the sample was collected and clone ID, and the taxonomic classifications for each OTU are imported into R.

#First I import the whole OTU table I recieved from Mothur. Note at this point taxonomy is not included, it is just raw OTU counts (columns) by sample or group (rows). Below I indicate that the row.names are in column 2 of the imported file. 
bigfile.sample.ds<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/barks16sstatistics_jnigra17_2019_ajo/bark16smothur_jnigra17_microbiome/bark16sotutable_jnigra17_fa19_ajo.shared",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE, row.names=2)

#Next I import the metadata for the particular habitat of interest. In this case I am working with drill shavings. I will use this metadata to subset the larger OTU file above. 
ds.met<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/barks16sstatistics_jnigra17_2019_ajo/bark16smothur_jnigra17_microbiome/bark16smetadata_jnigra17_su19_ajo.txt",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE)

#Next I import the taxonomy file. In earlier versions of this code, this contained taxonomic assignments for the whole dataset and not just a single habitat. Current version is only 16S drill shaving data.  
taxonomy.ds<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/barks16sstatistics_jnigra17_2019_ajo/bark16smothur_jnigra17_microbiome/bark16staxonomy_jnigra17_fa19_ajo.cons.taxonomy",header=TRUE,sep="\t",fill=TRUE,strip.white=TRUE,row.names=1)

#I then create a new object that contains the group names from the metadata file. 
ds.names<-ds.met$Group

Rarefaction and Singleton Removal

To evaluate coverage and differences in sequencing depth between samples, I generate rarefaction curves with sequencing depth on the x-axis and number of OTUs on the y-axis. These curves can be used to evaluate where to set the rarefaction cut-off to reduce sampling biases when calculating alpha-diversity measures such as diversity and observed richness. Additionally, to reduce the effects of rare taxa on ordinations, I remove singletons here as well. ## Rarefaction Curves Below I generate the rarefaction curves using the rarecurve function from vegan. This will take sometime to run, particularly on larger OTU tables. The output is then subset so that individual samples can be plotted in ggplot.

#I then create an object that spans the range of the OTU richness. This will be used to plot the vertical line in my ggplot rarefaction curve. 
f<-(0:2000)

#I then create objects containing the x and y coordinates values for each sample. This is probably not the most elegant way of creating rarefaction curves in ggplot, but I don't know a better way at the moment. 
sequenceds47<-attr(rarebacdscurve[[47]],which="Subsample")
sequenceds46<-attr(rarebacdscurve[[46]],which="Subsample")
sequenceds45<-attr(rarebacdscurve[[45]],which="Subsample")
sequenceds44<-attr(rarebacdscurve[[44]],which="Subsample")
sequenceds43<-attr(rarebacdscurve[[43]],which="Subsample") 
sequenceds42<-attr(rarebacdscurve[[42]],which="Subsample")
sequenceds41<-attr(rarebacdscurve[[41]],which="Subsample")
sequenceds40<-attr(rarebacdscurve[[40]],which="Subsample")
sequenceds39<-attr(rarebacdscurve[[39]],which="Subsample")
sequenceds38<-attr(rarebacdscurve[[38]],which="Subsample")
sequenceds37<-attr(rarebacdscurve[[37]],which="Subsample")
sequenceds36<-attr(rarebacdscurve[[36]],which="Subsample")
sequenceds35<-attr(rarebacdscurve[[35]],which="Subsample")
sequenceds34<-attr(rarebacdscurve[[34]],which="Subsample")
sequenceds33<-attr(rarebacdscurve[[33]],which="Subsample")
sequenceds32<-attr(rarebacdscurve[[32]],which="Subsample")
sequenceds31<-attr(rarebacdscurve[[31]],which="Subsample")
sequenceds30<-attr(rarebacdscurve[[30]],which="Subsample")
sequenceds29<-attr(rarebacdscurve[[29]],which="Subsample")
sequenceds28<-attr(rarebacdscurve[[28]],which="Subsample")
sequenceds27<-attr(rarebacdscurve[[27]],which="Subsample")
sequenceds26<-attr(rarebacdscurve[[26]],which="Subsample")
sequenceds25<-attr(rarebacdscurve[[25]],which="Subsample")
sequenceds24<-attr(rarebacdscurve[[24]],which="Subsample")
sequenceds23<-attr(rarebacdscurve[[23]],which="Subsample")
sequenceds22<-attr(rarebacdscurve[[22]],which="Subsample")
sequenceds21<-attr(rarebacdscurve[[21]],which="Subsample")
sequenceds20<-attr(rarebacdscurve[[20]],which="Subsample")
sequenceds19<-attr(rarebacdscurve[[19]],which="Subsample")
sequenceds18<-attr(rarebacdscurve[[18]],which="Subsample")
sequenceds17<-attr(rarebacdscurve[[17]],which="Subsample")
sequenceds16<-attr(rarebacdscurve[[16]],which="Subsample")
sequenceds15<-attr(rarebacdscurve[[15]],which="Subsample")
sequenceds14<-attr(rarebacdscurve[[14]],which="Subsample")
sequenceds13<-attr(rarebacdscurve[[13]],which="Subsample")
sequenceds12<-attr(rarebacdscurve[[12]],which="Subsample")
sequenceds11<-attr(rarebacdscurve[[11]],which="Subsample")
sequenceds10<-attr(rarebacdscurve[[10]],which="Subsample")
sequenceds9<-attr(rarebacdscurve[[9]],which="Subsample")
sequenceds8<-attr(rarebacdscurve[[8]],which="Subsample")
sequenceds7<-attr(rarebacdscurve[[7]],which="Subsample")
sequenceds6<-attr(rarebacdscurve[[6]],which="Subsample")
sequenceds5<-attr(rarebacdscurve[[5]],which="Subsample")
sequenceds4<-attr(rarebacdscurve[[4]],which="Subsample")
sequenceds3<-attr(rarebacdscurve[[3]],which="Subsample")
sequenceds2<-attr(rarebacdscurve[[2]],which="Subsample")
sequenceds1<-attr(rarebacdscurve[[1]],which="Subsample")

#Here I plot the rarefaction curves using ggplot. 
library(ggplot2)
library(ggpubr)
dsrarecurv<-ggplot()+
  geom_line(aes(x=sequenceds47,y=rarebacdscurve[[47]]), size=1)+
  geom_line(aes(x=sequenceds46,y=rarebacdscurve[[46]]),size=1)+
  geom_line(aes(x=sequenceds45,y=rarebacdscurve[[45]]),size=1 )+
  geom_line(aes(x=sequenceds44,y=rarebacdscurve[[44]]),size=1)+
  geom_line(aes(x=sequenceds43,y=rarebacdscurve[[43]]),size=1)+
  geom_line(aes(x=sequenceds42,y=rarebacdscurve[[42]]),size=1)+
  geom_line(aes(x=sequenceds41,y=rarebacdscurve[[41]]),size=1)+
  geom_line(aes(x=sequenceds40,y=rarebacdscurve[[40]]),size=1)+
  geom_line(aes(x=sequenceds39,y=rarebacdscurve[[39]]), size=1)+  
  geom_line(aes(x=sequenceds38,y=rarebacdscurve[[38]]), size=1)+  
  geom_line(aes(x=sequenceds37,y=rarebacdscurve[[37]]), size=1)+
  geom_line(aes(x=sequenceds36,y=rarebacdscurve[[36]]),size=1)+
  geom_line(aes(x=sequenceds35,y=rarebacdscurve[[35]]),size=1 )+
  geom_line(aes(x=sequenceds34,y=rarebacdscurve[[34]]),size=1)+
  geom_line(aes(x=sequenceds33,y=rarebacdscurve[[33]]),size=1)+
  geom_line(aes(x=sequenceds32,y=rarebacdscurve[[32]]),size=1)+
  geom_line(aes(x=sequenceds31,y=rarebacdscurve[[31]]),size=1)+
  geom_line(aes(x=sequenceds30,y=rarebacdscurve[[30]]),size=1)+
  geom_line(aes(x=sequenceds29,y=rarebacdscurve[[29]]), size=1)+  
  geom_line(aes(x=sequenceds28,y=rarebacdscurve[[28]]), size=1)+
  geom_line(aes(x=sequenceds27,y=rarebacdscurve[[27]]), size=1)+
  geom_line(aes(x=sequenceds26,y=rarebacdscurve[[26]]),size=1)+
  geom_line(aes(x=sequenceds25,y=rarebacdscurve[[25]]),size=1 )+
  geom_line(aes(x=sequenceds24,y=rarebacdscurve[[24]]),size=1)+
  geom_line(aes(x=sequenceds23,y=rarebacdscurve[[23]]),size=1)+
  geom_line(aes(x=sequenceds22,y=rarebacdscurve[[22]]),size=1)+
  geom_line(aes(x=sequenceds21,y=rarebacdscurve[[21]]),size=1)+
  geom_line(aes(x=sequenceds20,y=rarebacdscurve[[20]]),size=1)+
  geom_line(aes(x=sequenceds19,y=rarebacdscurve[[19]]), size=1)+
  geom_line(aes(x=sequenceds18,y=rarebacdscurve[[18]]),size=1)+
  geom_line(aes(x=sequenceds17,y=rarebacdscurve[[17]]),size=1)+
  geom_line(aes(x=sequenceds16,y=rarebacdscurve[[16]]),size=1)+
  geom_line(aes(x=sequenceds15,y=rarebacdscurve[[15]]),size=1)+
  geom_line(aes(x=sequenceds14,y=rarebacdscurve[[14]]), size=1)+
  geom_line(aes(x=sequenceds13,y=rarebacdscurve[[13]]), size=1)+
  geom_line(aes(x=sequenceds12,y=rarebacdscurve[[12]]),size=1)+
  geom_line(aes(x=sequenceds11,y=rarebacdscurve[[11]]),size=1 )+
  geom_line(aes(x=sequenceds10,y=rarebacdscurve[[10]]),size=1)+
  geom_line(aes(x=sequenceds9,y=rarebacdscurve[[9]]),size=1)+
  geom_line(aes(x=sequenceds8,y=rarebacdscurve[[8]]),size=1)+
  geom_line(aes(x=sequenceds7,y=rarebacdscurve[[7]]),size=1)+
  geom_line(aes(x=sequenceds6,y=rarebacdscurve[[6]]),size=1)+
  geom_line(aes(x=sequenceds5,y=rarebacdscurve[[5]]), size=1)+
  geom_line(aes(x=sequenceds4,y=rarebacdscurve[[4]]),size=1)+
  geom_line(aes(x=sequenceds3,y=rarebacdscurve[[3]]),size=1)+
  geom_line(aes(x=sequenceds2,y=rarebacdscurve[[2]]),size=1)+
  geom_line(aes(x=sequenceds1,y=rarebacdscurve[[1]]),size=1)+
  geom_path(aes(x=2000,y=f), linetype=5)+
  #scale_color_manual(values=c("#00CCCC","#336699","#336699","#336699","#663300","#663300","#00CCCC","#CC9933","#CC9933","#CC9933","#0000FF","#0000FF","#006600","#006600"))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
labs(x=("Sequencing Depth"), y=expression(bold(paste("OTU Richness", sep=""))))
dsrarecurv

Sample Rarefaction

Below I rarefy my samples using the rrarefy function from vegan. Rarefaction cut-offs are determined by reviewing the rarefaction curves and identifying the lowest number of sequences that minimizes sample loss but maximizing sampling depth across samples. Based on the above rarefaction curve, a cut-off of 2,000 sequences per samples results in the loss of seven samples (three from TN and four from WA).

##     TN132CDSBAC     TN272BDSBAC WABNL20130DSBAC     TN130ADSBAC  WARN10272DSBAC 
##             376             691            1278            1319            1378 
## WABNL17272DSBAC    WARN255DSBAC      TN55CDSBAC   INMCB755DSBAC   TNWTMB19DSBAC 
##            1708            1962            2088            2192            2332 
##   TNWTMB21DSBAC  WABNL1955DSBAC     TN272CDSBAC     TN272ADSBAC INMCB26130DSBAC 
##            2389            2471            2917            3016            3319 
##      TN55BDSBAC   TNWTMB20DSBAC  WABNL21WTDSBAC     TN132BDSBAC      TN55ADSBAC 
##            3349            4172            4334            4453            4710 
##  INWTMCB33DSBAC INMCB16132DSBAC      TNLS1DSBAC     TN130BDSBAC   INMCB655DSBAC 
##            5077            5141            5186            5509            5655 
##     TN132ADSBAC WABNL18272DSBAC      TNLS2DSBAC   WARN4130DSBAC   INMCB855DSBAC 
##            5728            6023            6224            6615            6864 
## INMCB27132DSBAC   WARN9132DSBAC INMCB10272DSBAC     TN130CDSBAC  INMCB28WTDSBAC 
##            6973            7126            7371            8097            8375 
##  INMCB9130DSBAC  INMCB2130DSBAC INMCB11272DSBAC  INWTMCB29DSBAC INMCB17132DSBAC 
##            8603            8837            9246            9997           10608 
##  WABNL22WTDSBAC    WARN155DSBAC      TNLS3DSBAC  WABNL23WTDSBAC   WARN8132DSBAC 
##           11832           12240           16021           16458           16854 
##   WARN7132DSBAC INMCB24272DSBAC 
##           22993           43844

Data Subsetting

Below I remove unrarefied samples (those less than minimum sequencing depth) as well as singleton OTUs (those with only one sequence following rarefaction).

#Because the rrarefy function doesn't automatically remove samples from the dataframe I subset the rarefied dataframe to remove unrarefied samples (those less than the minimum sequencing depth). 
ds.rareotu<-as.data.frame(subset(ds.rareotu,rowSums(ds.rareotu)>=2000))

#First I create an object that contains the sample names of the samples that passed through the rarefaction step. This will be used to filter the metadata file. 
ds.samples<-row.names(ds.rareotu)
ds.met.rare<-subset(ds.met,ds.met$Group%in% ds.samples)

#I am now removing OTUs present in the OTU table that were only present in samples removed following rarefaction. These OTUs will have column sums of 0. 
ds.rareotu<-ds.rareotu[,colSums(ds.rareotu)>0]

#The following line is for saving the rarefied OTU table. This file is later reloaded to be used in downstream analyses. This is done to maintain consistency of results because rarefying generates slightly different results each time. 
#write.table(ds.rareotu, "~/Documents/microbiome2017/16s_bark_otutable_rarefied_fa19_ajo.shared",sep="\t", row.names=TRUE)

#I am now removing singleton OTUs (OTUs with only 1 representative sequence following rarefaction). This is to limit the effects of rare species on our distance matrices used during ordination.  
ds.rareotu.nosingletons<-ds.rareotu[,colSums(ds.rareotu)>1]


#The following line is for saving the rarefied OTU table with singletons removed. This file is later reloaded to be used in downstream analyses. This is done to maintain consistency of results because rarefying generates slightly different results each time. 
#write.table(ds.rareotu.nosingletons, "~/Documents/microbiome2017/16s_bark_otutable_rarefied_nosingleton_fa19_ajo.shared",sep="\t", row.names=TRUE)

Relative Abundance Charts

To visualize differences in the relative abundance of different taxa, I reorganize the OTU table and taxonomy strings to create relative abundance stacked bar charts. Below, stacked bar charts are made for phylum, class, and order levels.

Taxonomy Subsetting

To begin, the taxonomy file is subsetted to include only taxonomy information for OTUs present in the rarefied OTU table with singletons removed. The OTU table is first transposed so that OTU IDs are the row names and sample names are the column names. Then the taxonomic assignments are added to the OTU table. Taxonomic strings are then split by semi-colon into individual columns by Phylum, Class, Order, Family, and Genus (SILVA does not go to species level).

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information. 
ds.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/barks16sstatistics_jnigra17_2019_ajo/bark16smothur_jnigra17_microbiome/bark16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.ds.rareotu.nosingles<-as.data.frame(t(ds.rareotu.nosingles))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.ds.rareotu.nosingleslabs<-labels(t.ds.rareotu.nosingles)
ds.taxrare<-subset(taxonomy.ds, rownames(taxonomy.ds) %in% t.ds.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
ds.taxrareinfo<-(ds.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
ds.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information. 
t.ds.rareotutabtax<-data.frame(taxonomy=ds.taxrareinfo,t.ds.rareotu.nosingles)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.ds.rareotutabtaxsep<-separate(t.ds.rareotutabtax,into=ds.taxonomylabs,col=taxonomy,sep=";")
head(t.ds.rareotutabtaxsep)
##               kingdom              phylum                    class
## Otu0001 Bacteria(100) Proteobacteria(100) Gammaproteobacteria(100)
## Otu0002 Bacteria(100) Proteobacteria(100) Alphaproteobacteria(100)
## Otu0003 Bacteria(100) Proteobacteria(100) Alphaproteobacteria(100)
## Otu0004 Bacteria(100) Proteobacteria(100) Gammaproteobacteria(100)
## Otu0005 Bacteria(100)  Bacteroidetes(100)         Bacteroidia(100)
## Otu0006 Bacteria(100) Proteobacteria(100) Gammaproteobacteria(100)
##                              order                   family
## Otu0001     Enterobacteriales(100)  Enterobacteriaceae(100)
## Otu0002      Sphingomonadales(100)   Sphingomonadaceae(100)
## Otu0003      Sphingomonadales(100)   Sphingomonadaceae(100)
## Otu0004 Betaproteobacteriales(100)    Burkholderiaceae(100)
## Otu0005    Sphingobacteriales(100) Sphingobacteriaceae(100)
## Otu0006 Betaproteobacteriales(100)    Burkholderiaceae(100)
##                                        genus INMCB10272DSBAC INMCB11272DSBAC
## Otu0001 Enterobacteriaceae_unclassified(100)               0               0
## Otu0002                     Sphingomonas(95)              64              97
## Otu0003                    Sphingomonas(100)               0               0
## Otu0004    Burkholderiaceae_unclassified(96)               0               0
## Otu0005 Sphingobacteriaceae_unclassified(79)               0               3
## Otu0006                        Massilia(100)               4              11
##         INMCB16132DSBAC INMCB17132DSBAC INMCB2130DSBAC INMCB24272DSBAC
## Otu0001               5              28             20               0
## Otu0002              34              24             59              64
## Otu0003               0               0              1               0
## Otu0004               1               3              1               0
## Otu0005               4               1              5               0
## Otu0006               8               2             11               1
##         INMCB26130DSBAC INMCB27132DSBAC INMCB28WTDSBAC INMCB655DSBAC
## Otu0001               1               0             83             2
## Otu0002              54             107             74            83
## Otu0003               0               0              0             0
## Otu0004               0               0              0             0
## Otu0005              11              17              9            11
## Otu0006              17               7             13            30
##         INMCB755DSBAC INMCB855DSBAC INMCB9130DSBAC INWTMCB29DSBAC
## Otu0001            49             0              0             11
## Otu0002            59            77             51             84
## Otu0003             0             0              0              1
## Otu0004             4             1              3              3
## Otu0005             1             8              5              1
## Otu0006            18            14              1              5
##         INWTMCB33DSBAC TN130BDSBAC TN130CDSBAC TN132ADSBAC TN132BDSBAC
## Otu0001              1          10          23           2           3
## Otu0002            104         119         141         106         138
## Otu0003              0           4           0           0           2
## Otu0004              0           0           4           0           1
## Otu0005              4          10          46           3          17
## Otu0006              8           7          45           3           0
##         TN272ADSBAC TN272CDSBAC TN55ADSBAC TN55BDSBAC TN55CDSBAC TNLS1DSBAC
## Otu0001           2          35          0          4          0        329
## Otu0002          65         108         74         44         80         63
## Otu0003           1           3          2          0          0          0
## Otu0004          13           0          0          0          0          0
## Otu0005           5          22         17          8         40          1
## Otu0006           5           4          6          1          2          0
##         TNLS2DSBAC TNLS3DSBAC TNWTMB19DSBAC TNWTMB20DSBAC TNWTMB21DSBAC
## Otu0001        236       1013           166             2             4
## Otu0002         73         20            93            88            86
## Otu0003          0          0             5             0             0
## Otu0004          2          0             0             0             0
## Otu0005         11          5            14             6             3
## Otu0006          2          0            25             6             8
##         WABNL18272DSBAC WABNL1955DSBAC WABNL21WTDSBAC WABNL22WTDSBAC
## Otu0001             556             79              5              1
## Otu0002              52             78             71             94
## Otu0003             163             60            162            152
## Otu0004              57             38            120            133
## Otu0005             165            137            141            227
## Otu0006              75            145            231            168
##         WABNL23WTDSBAC WARN155DSBAC WARN4130DSBAC WARN7132DSBAC WARN8132DSBAC
## Otu0001            190          160           121            39            19
## Otu0002             52           44            22            43            32
## Otu0003            327          133            57           133            72
## Otu0004            143          139           136           175           123
## Otu0005             95           78            62           102            57
## Otu0006            175           85            83            96            29
##         WARN9132DSBAC
## Otu0001            51
## Otu0002            40
## Otu0003           152
## Otu0004            90
## Otu0005            59
## Otu0006            98

Relative Abundance OTU Table

I then create a relative abundance OTU table. This table contains taxonomic information and can be searched to look at specific OTUs more easily.

#Now that samples have been rarefied I can perform alpha and beta diversity analyses on my data sets. To begin I first reformat my OTU table so that I can do some filtering and add taxonomy information. 
ds.rareotu.nosingles<-read.table("/Volumes/AaronOnufrakMac/researchprojects_2019.2022_ajo/j.nigra_microbiome_2017_ajo_gmw/barks16sstatistics_jnigra17_2019_ajo/bark16smothur_jnigra17_microbiome/bark16sotutable.rarefied.nosingleton_jnigra17_fa19_ajo.shared", header=TRUE, sep="\t")

ds.rareotu.nosingles.table<-decostand(ds.rareotu.nosingles,method="total")


#I now transpose my OTU table so that I can begin to add in taxonomy information. 
t.ds.rareotu.nosingles.table<-as.data.frame(t(ds.rareotu.nosingles.table))

#The below function is asking specifically for the OTU IDs. These are the row names. I will use these to subset the taxonomy file. 
t.ds.rareotu.nosingleslabs<-labels(t.ds.rareotu.nosingles.table)
ds.taxrare<-subset(taxonomy.ds, rownames(taxonomy.ds) %in% t.ds.rareotu.nosingleslabs[[1]])

#Now I am subsetting the new taxonomy file to include only the taxonomy information stored in column 2. 
ds.taxrareinfo<-(ds.taxrare[,2])

#Below I am using the separate function from tidyr to split the taxonomy strings into columns by semi-colon so that I can rename the OTUs to give them more meaning for downstream analyses.
library(tidyr)

#The SILVA database only provides classifications up to genus level, new columns are created only up to Genus level.
ds.taxonomylabs<-c("kingdom","phylum","class","order","family","genus")

#Below I create a new data frame with a column containing taxonomy information. 
t.ds.rareotutabtax.table<-data.frame(taxonomy=ds.taxrareinfo,t.ds.rareotu.nosingles.table)

#I then separate the taxonomy strings into new columns labeled with the taxonomy labels saved in the taxonomylabs object. Strings are being separated based on the presence of semi-colons. 
t.ds.rareotutabtaxsep.table<-separate(t.ds.rareotutabtax.table,into=ds.taxonomylabs,col=taxonomy,sep=";")
datatable(t.ds.rareotutabtaxsep.table, fillContainer = T, height="500dpx")

Phylum Relative Abudance Charts

Below I generate a phylum relative abundance plot. I first extract the phylum information. I then clean the taxa names to remove assignment confidence values. OTU raw abundance is then summed by phylum and state, phyla that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function and ggplot.

#Below I create a data frame that consists of phylum assignment and the rarefied OTU table. 
t.ds.raretabphy<-data.frame(phylum=t.ds.rareotutabtaxsep$phylum,t.ds.rareotu.nosingles)

#I then remove the assignment confidence values contained in parentheses found in each taxonomic string using the sub function.
t.ds.raretabphy$phylum<-sub("\\(.*)","",x=t.ds.raretabphy$phylum)

#The following code is for the generation of phylum relative abundance stacked bar charts. I first sum each OTU by the phylum it belongs to. 
library(plyr)
t.ds.rareotutabphy<-as.data.frame(ddply(t.ds.raretabphy, .(phylum),colwise(sum)))
rownames(t.ds.rareotutabphy)<-make.names(t.ds.rareotutabphy$phylum)
t.ds.rareotutabphy<-t.ds.rareotutabphy[,2:41]

#I then transpose the data frame so that sample names are now row names and column names are OTU names. 
ds.rareotutabphy<-as.data.frame(t(t.ds.rareotutabphy))
ds.rareotutabphy.state<-data.frame(state=ds.met.rare$State,ds.rareotutabphy)

#I then sum each OTU by state and subset to exclude sample names. This is because the following functions only work on matrices containing numeric data.  
ds.rareotutabphy.statesum<-as.data.frame(ddply(ds.rareotutabphy.state, .(state),colwise(sum)))
ds.phylumcols<-ds.rareotutabphy.statesum[,2:20]

#I am now creating an other category. Other includes those OTUs belonging to a phylum that comprises less than 1% of the total community. This new object is referred to as phyothers.
ds.phyoth<-ds.phylumcols[,colSums(ds.phylumcols)/sum(ds.phylumcols)<=0.01]
ds.phyothers<-rowSums(ds.phyoth)

#I then create an object that contains all of the phyla that comprise more than 1% of the total community. 
ds.phyreg<-ds.phylumcols[,colSums(ds.phylumcols)/sum(ds.phylumcols)>0.01]

#I then create a new dataframe containing the state information, the other column, and the remaining phyla.
ds.phytot2<-data.frame(state=ds.rareotutabphy.statesum$state,ds.phyreg,Other=ds.phyothers)

#These values are then converted to relative abundance using the decostand function from vegan. 
library(vegan)
ds.phyrelabund<-decostand(ds.phytot2[,2:10],method="total")
ds.phyrelabund<-data.frame(state=ds.rareotutabphy.statesum$state,ds.phyrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format. 
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
ds.phyrelabundmelt<-melt(ds.phyrelabund, id.vars="state", variable.name="Phylum")
ds.colors.n.phylum <- length(unique(ds.phyrelabundmelt[,'Phylum']))

#Below I generate the relative abundance bar charts in ggplot
ds.phyrel<-ggplot(ds.phyrelabundmelt, aes(x=state, y=value, fill=Phylum))+
  geom_bar(stat="identity", show.legend=TRUE, color="black")+
  scale_fill_manual(values=colorRampPalette(brewer.pal(9, 'Set1'))(ds.colors.n.phylum)) +
  xlab("State") +
  ylab("Relative Abundance") +
  scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
geom_text(data=NULL,aes(x=0.5, y=1.05,label="16S Caulosphere"),hjust=0,colour="black")
library(plotly)
ggplotly(ds.phyrel,tooltip=c("Phylum"))

Class Relative Abundance Chart

Below I generate a class relative abundance plot. I first extract the class information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the class assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the class level. OTU raw abundance is then summed by class and state, classes that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the class level. 
library(plyr)
t.ds.raretabcls<-data.frame(class=t.ds.rareotutabtaxsep$class,t.ds.rareotu.nosingles)

#What I am doing below is cleaning up the class names. These include assignment confidence levels, _or, and _.
t.ds.raretabcls$class<-sub("\\(.*)","",x=t.ds.raretabcls$class)
t.ds.raretabcls$class<-sub("_or","",x=t.ds.raretabcls$class)
t.ds.raretabcls$class<-sub("_"," ",x=t.ds.raretabcls$class)

#I then sum the OTUs by class and make the row names the class name. 
t.ds.rareotutabcls<-as.data.frame(ddply(t.ds.raretabcls, .(class),colwise(sum)))
rownames(t.ds.rareotutabcls)<-make.names(t.ds.rareotutabcls$class)
t.ds.rareotutabcls<-t.ds.rareotutabcls[,2:41]

#I then transpose the data frame and sum each class by state. 
ds.rareotutabcls<-as.data.frame(t(t.ds.rareotutabcls))
ds.rareotutabcls.state<-data.frame(state=ds.met.rare$State,ds.rareotutabcls)
ds.rareotutabcls.statesum<-as.data.frame(ddply(ds.rareotutabcls.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown. 
ds.clsunknown<-ds.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(ds.rareotutabcls.statesum))]
ds.clsunknownsum<-rowSums(ds.clsunknown)

#I then use the same grep function as above to pull out the class with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command. 
ds.clstot<-ds.rareotutabcls.statesum[,grep(".unclassified|Unknown.class|uncultured",names(ds.rareotutabcls.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from classes that represent 1% or less of the total community. 
ds.clscols<-ds.clstot[,2:44]
ds.clsoth<-ds.clscols[,colSums(ds.clscols)/sum(ds.clscols)<=0.01]
ds.clsothers<-rowSums(ds.clsoth)
ds.clsreg<-ds.clscols[,colSums(ds.clscols)/sum(ds.clscols)>0.01]

#I then create a new data frame that contains all of the class data and determine relative abundance using the decostand function from vegan. 
ds.clstot2<-data.frame(state=ds.clstot$state,ds.clsreg,Other=ds.clsothers,Unclassified=ds.clsunknownsum)
library(vegan)
ds.clsrelabund<-decostand(ds.clstot2[,2:12],method="total")
ds.clsrelabund<-data.frame(state=ds.rareotutabcls.statesum$state,ds.clsrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
ds.clsrelabundmelt<-melt(ds.clsrelabund, id.vars="state", variable.name="class")

ds.cls.colorCount<- length(unique(ds.clsrelabundmelt[,'class']))
getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
ds.clsrel<-ggplot(ds.clsrelabundmelt, aes(x=state, y=value, fill=class))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=getPalette(ds.cls.colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
  scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.51, y=1.05,label="16S Caulosphere"),hjust=0,colour="black")

library(plotly)
ggplotly(ds.clsrel,tooltip=c("class"))

Order Relative Abundance Chart

Below I generate an order relative abundance plot. I first extract the order information from the OTU table that contains the taxonomy information. I then clean the taxa names to remove assignment confidence values and other non-taxonomic associated string portions from the order assignment. I then create an unknown group which consists of uncultured bacteria and bacteria unclassified at the order level. OTU raw abundance is then summed by order and state, orders that represent less than 1% of the total community are grouped into other, and relative abundance is calculated. Stacked bar charts are made using the melt function from ggplot.

#The following chunk creates a relative abundance bar chart at the order level.

#What I am doing below is cleaning up the order names. These include assignment confidence levels, _or, and _.
library(plyr)
t.ds.raretabord<-data.frame(order=t.ds.rareotutabtaxsep$order,t.ds.rareotu.nosingles)
t.ds.raretabord$order<-sub("\\(.*)","",x=t.ds.raretabord$order)
t.ds.raretabord$order<-sub("_or","",x=t.ds.raretabord$order)
t.ds.raretabord$order<-sub("_"," ",x=t.ds.raretabord$order)

#I then sum the OTUs by class and make the row names the order name. 
t.ds.rareotutabord<-as.data.frame(ddply(t.ds.raretabord, .(order),colwise(sum)))
rownames(t.ds.rareotutabord)<-make.names(t.ds.rareotutabord$order)
t.ds.rareotutabord<-t.ds.rareotutabord[,2:41]

#I then transpose the data frame and sum each order by state. 
ds.rareotutabord<-as.data.frame(t(t.ds.rareotutabord))
ds.rareotutabord.state<-data.frame(state=ds.met.rare$State,ds.rareotutabord)
ds.rareotutabord.statesum<-as.data.frame(ddply(ds.rareotutabord.state, .(state),colwise(sum)))

#At the class level, SILVA may classify things as unknown class, phylum.unclassified, or uncultured. These don't provide a lot of information and thus are grouped into a knew group called unknown. 
ds.ordunknown<-ds.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(ds.rareotutabord.statesum))]

#I then use the same grep function as above to pull out the order with meaningful information. By setting invert=TRUE I select items lacking the strings in the grep command. 
ds.ordunknownsum<-rowSums(ds.ordunknown)
ds.ordtot<-ds.rareotutabord.statesum[,grep(".unclassified|Unknown.Order|uncultured",names(ds.rareotutabord.statesum), invert=TRUE)]

#I then subset the data so that only numerical data is present and creating a new class of objects, other, that includes OTUs from order that represent 1% or less of the total community. 
ds.ordcols<-ds.ordtot[,2:92]
ds.ordoth<-ds.ordcols[,colSums(ds.ordcols)/sum(ds.ordcols)<=0.01]
ds.ordothers<-rowSums(ds.ordoth)
ds.ordreg<-ds.ordcols[,colSums(ds.ordcols)/sum(ds.ordcols)>0.01]

#I then create a new data frame that contains all of the order data and determine relative abundance using the decostand function from vegan. 
library(vegan)
ds.ordtot2<-data.frame(state=ds.ordtot$state,ds.ordreg,Other=ds.ordothers, Unclassified=ds.ordunknownsum)
ds.ordrelabund<-decostand(ds.ordtot2[,2:25],method="total")
ds.ordrelabund<-data.frame(state=ds.rareotutabord.statesum$state,ds.ordrelabund)

#Below I use the melt function from the data.table package to reformat the data into stacked bar graph format.
library(data.table)
library(ggplot2)
library(ggpubr)
library(RColorBrewer)
ds.ordrelabundmelt<-melt(ds.ordrelabund, id.vars="state", variable.name="order")

ds.ord.colorCount<- length(unique(ds.ordrelabundmelt[,'order']))
getPalette=colorRampPalette(brewer.pal(9,"Set1"))

#Below I generate the relative abundance bar charts in ggplot
ds.ordrel<-ggplot(ds.ordrelabundmelt, aes(x=state, y=value, fill=order))+
  geom_bar(stat="identity",show.legend=TRUE,color="black")+
  scale_fill_manual(values=getPalette(ds.ord.colorCount))+
  xlab("State") +
  ylab("Relative Abundance") +
  theme(panel.border = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.x = element_text(size = 14), axis.ticks.x = element_blank(), axis.text.y = element_text(size = 14), axis.line.y.left = element_line(), axis.title.y = element_text(size = 14))+
  scale_y_continuous(breaks=c(0,0.25,0.50,0.75,1),limits=c(0,1.05))+
  geom_text(data=NULL,aes(x=0.51, y=1.05,label="16S Caulosphere"),hjust=0,colour="black")
library(plotly)
ggplotly(ds.ordrel,tooltip=c("order"))

Taxonomy Filtering and Merging

Taxonomy is first filtered to include only those OTUs present in the rarefied OTU table with no singletons. Then the OTU table is transposed so that OTU ID is the row name and sample ID is the column name. Taxonomy strings are then added to the OTU table and strings are split by semi-colon.

Principal Coordinate Analysis

To visually assess differences in the community composition of drill shaving bacterial communities by clone and state, I perform a principal coordinate analysis (PCoA).

Relative Abundance Calculation

Prior to conducting a PCoA I first convert my OTU table from raw OTU abundances to relative abundance.

Principal Coordinate Analysis

I then perform the PCoA using the pcoa function from the ape package. Prior to performing the PCoA, a distance matrix needs to be calculated for the OTU table. In this case, I use the bray-curtis method.

Plotting of Principal Coordinate Analysis

Then to plot the PCoA, I use the ggplot function. To use ggplot I need to first pull out the site scores for each sample and generate ellipses.

#We need to first subset our metadata table to include only those samples that passed rarefaction.  
ds.met.rare<-subset(ds.met,ds.met$Group%in% ds.samples)

#Then we extract PCoA site scores (these will be the x and y coordinates for each sample)
ds.bac.pcoavec<-as.data.frame(ds.bac.pcoa$vectors)
ds.bac.pcoasitescores<-data.frame(PC1=ds.bac.pcoavec$Axis.1, PC2=ds.bac.pcoavec$Axis.2)

#Create a new dataframe that includes the site scores from above with metadata from your study. 
ds.bac.pcoagraph<-data.frame(ds.bac.pcoasitescores,PC1=ds.bac.pcoasitescores$PC1, PC2=ds.bac.pcoasitescores$PC2, State=ds.met.rare$State, Clone=ds.met.rare$Clone,group=ds.met.rare$Group)

#This is where you make confidence ellipses. I don't know what all the code means. Just know where to plug in my objects. 
ds.bac.pcoaellipse<-ordiellipse(ds.bac.pcoasitescores,ds.bac.pcoagraph$State, display="sites", kind="sd", draw="none")

df_ell <- data.frame()
for(g in levels(ds.bac.pcoagraph$State)){
df_ell <- rbind(df_ell, cbind(as.data.frame(with(ds.bac.pcoagraph[ds.bac.pcoagraph$State==g,],                                                vegan:::veganCovEllipse(ds.bac.pcoaellipse[[g]]$cov,ds.bac.pcoaellipse[[g]]$center,ds.bac.pcoaellipse[[g]]$scale))) ,State=g))}

#Plot PCoA in ggplot
library(ggplot2)
ds.pcoa<-ggplot(ds.bac.pcoagraph, aes(PC1,PC2))+
  #geom_text(aes(label=group))+
  geom_point(aes(shape=Clone,colour=State), size=3.5)+
  geom_path(data=df_ell, aes(x=PC1, y=PC2, colour=State),size=1, linetype=2)+
 theme(axis.title.x=element_text(size=14, face="bold"))+
  theme(axis.title.y=element_text(size=14, face="bold"))+
  theme(axis.text.x=element_text(size=12, face="bold"))+
  theme(axis.text.y=element_text(size=12, face="bold"))+ 
  scale_color_manual(values=c("#006600","#3399FF","#FF9900"))+
  geom_text(aes(x=-0.60, y=0.60, label="16S Caulosphere"),hjust=0,colour="black")+
  scale_shape_manual(values=c(16,10,8,7,0,2))+
  scale_x_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-0.60,0.60))+
  scale_y_continuous(breaks=c(-0.60,-0.30,0,0.30,0.60),limits=c(-.60,.60))+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
    panel.border=element_rect(colour="black", size=1, fill=NA))+
  theme(panel.grid.major=element_blank(),
       panel.grid.minor=element_blank(),
       panel.background = element_blank(),
  panel.border=element_rect(color="black", size=1, fill=NA))
ds.pcoa

PERMANOVA

To formally test how well state and clone explain differences in drill shaving bacterial community composition, I perform a PERMANOVA using the adonis function from vegan

Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
State 2 4.1756717 2.0878359 12.705066 0.3980093 0.0001000
Clone 4 0.8927993 0.2231998 1.358234 0.0850983 0.0954905
Residuals 33 5.4229222 0.1643310 NA 0.5168925 NA
Total 39 10.4913932 NA NA 1.0000000 NA

Richness and Diversity

Then to assess the alpha diversity of our samples, I calculate the observed richness and shannon diversity index using the vegan package. This is done using the rarefied OTU table that contains singletons. Singletons are included to keep sampling depth standard between samples.

Richness

Observed OTU richness is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

Sum Sq Df F value Pr(>F)
(Intercept) 984987.00 1 392.965754 0.0000000
State 159328.33 2 31.782439 0.0000001
clone 36200.00 4 3.610545 0.0186698
State:clone 31511.65 8 1.571467 0.1837637
Residuals 62663.67 25 NA NA
## Warning: not plotting observations with leverage one:
##   31, 37

## Warning: not plotting observations with leverage one:
##   31, 37

Sum Sq Df F value Pr(>F)
(Intercept) 1572843.02 1 551.14041 0.0000000
State 810315.90 2 141.97152 0.0000000
clone 26159.62 4 2.29165 0.0803565
Residuals 94175.31 33 NA NA
diff lwr upr p adj
TN-IN -199.266667 -247.13178 -151.401550 0.0000000
WA-IN -374.000000 -427.51483 -320.485172 0.0000000
WA-TN -174.733333 -228.24816 -121.218506 0.0000000
132-130 -80.522222 -163.73650 2.692059 0.0617624
272-130 -55.333333 -144.29314 33.626476 0.3939450
55-130 -59.238889 -142.45317 23.975393 0.2641783
WT-130 -34.455556 -111.49701 42.585899 0.6990368
272-132 25.188889 -58.02539 108.403171 0.9047050
55-132 21.283333 -55.75812 98.324788 0.9296912
WT-132 46.066667 -24.26224 116.395571 0.3427999
55-272 -3.905556 -87.11984 79.308726 0.9999201
WT-272 20.877778 -56.16368 97.919232 0.9341341
WT-55 24.783333 -45.54557 95.112238 0.8459852

Sum Sq Df F value Pr(>F)
(Intercept) 3855735.0 1 1185.5426 0
State 862382.0 2 132.5805 0
Residuals 120334.9 37 NA NA
diff lwr upr p adj
TN-IN -199.2667 -250.1082 -148.4252 0
WA-IN -374.0000 -430.8425 -317.1575 0
WA-TN -174.7333 -231.5759 -117.8908 0

Diversity

Shannon diversity is first calculated for each sample. Then due to the unbalanced nature of our study design, we use a type 3 ANOVA.

Sum Sq Df F value Pr(>F)
(Intercept) 98.188131 1 417.1432002 0.0000000
State 3.334387 2 7.0829177 0.0036552
clone 1.388572 4 1.4748048 0.2396924
State:clone 1.760811 8 0.9350801 0.5058024
Residuals 5.884558 25 NA NA
## Warning: not plotting observations with leverage one:
##   31, 37

## Warning: not plotting observations with leverage one:
##   31, 37

Sum Sq Df F value Pr(>F)
(Intercept) 163.9754567 1 707.773621 0.0000000
State 14.9876614 2 32.345912 0.0000000
clone 0.9397214 4 1.014039 0.4143667
Residuals 7.6453684 33 NA NA

Sum Sq Df F value Pr(>F)
(Intercept) 424.51279 1 1829.56423 0
State 16.31632 2 35.16002 0
Residuals 8.58509 37 NA NA

Aaron Onufrak

Fall 2019